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Abstract 

We consider the linear regression problem, where the number p of co- 
variates is possibly larger than the number n of observations (xi,yi)i<i< n , 
under sparsity assumptions. On the one hand, several methods have been 
successfully proposed to perform this task, for example the LASSO in 
[Tib96] or the Dantzig Selector in [CT07] . On the other hand, consider 
new values {xi) n +\<i<m- If one wants to estimate the corresponding yt's, 
one should think of a specific estimator devoted to this task, referred in 
Vap98] as a "transductive" estimator. This estimator may differ from an 
estimator designed to the more general task "estimate on the whole do- 
main". In this work, we propose a generalized version both of the LASSO 
and the Dantzig Selector, based on the geometrical remarks about the 
LASSO in |Alq08| [AH08] . The "usual" LASSO and Dantzig Selector, as 
well as new estimators interpreted as transductive versions of the LASSO, 
appear as special cases. These estimators are interesting at least from 
a theoretical point of view: we can give theoretical guarantees for these 
estimators under hypotheses that are relaxed versions of the hypotheses 
required in the papers about the "usual" LASSO. These estimators can 
also be efficiently computed, with results comparable to the ones of the 
LASSO. 
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1 Introduction 

In many modern applications, a statistician often have to deal with very large 
datasets. Regression problems may involve a large number of covariates p, pos- 
sibly larger than the sample size n. In this situation, a major issue is dimension 
reduction, which can be performed through the selection of a small amount of 
relevant covariates. For this purpose, numerous regression methods have been 
proposed in the literature, ranging from the classical information criteria such 
as AIC |Aka73j and BIC |Sch78] to the more recent sparse methods, known as 
the LASSO |Tib96j . and the Dantzig Selector |CT07j . Regularized regression 
methods have recently witnessed several developments due to the attractive fea- 
ture of computational feasibility, even for high dimensional data (i.e., when the 
number of covariates p is large). We focus on the usual linear regression model: 

yi = Xif3*+ei, i = l,...,n, (1) 

where the design Xi — (x^i, . . . g W is deterministic, 0* = (/3*, . . . ,(3*)' € 

M. p is the unknown parameter and ei, . . . ,e n are i.i.d. centered Gaussian random 
variables with known variance a 2 . Let X denote the matrix with z-th line equal 
to Xi, and let Xj denote its j-th column, with i 6 {1, . . . , n} and j G {1, . . . ,p}. 
So: 

X = (x[, . . .,x' n )' = (Xi, . . . , Xp). 

For the sake of simplicity, we will assume that the observations are normalized 
in such a way that X'jXj/n — 1. We denote by Y the vector Y = . . . , y n )' ■ 
For all a < 1 and any vector v £ M, d , we set || • \\ a , the norm: \\v\\ a = (\vi\ a + 
■ ■ ■ + \vd\ a ) 1 ^ a ■ In particular || • ||2 is the euclidean norm. Moreover for all d£M, 
we use the notation ||«||q = Yli=i ~^( v i 7^ 0)- 

The problem of estimating the regression parameter in the high dimensional 
setting have been extensively studied in the statistical literature. Among others, 
the LASSO |Tib96j (denoted by (3 L ), the Dantzig Selector |CT07j (denoted by 
P DS ) and the non-negative garrote (in Yuan and Lin |YL07 |. denoted by /3 ) 
have been proposed to deal with this problem for a large p, even for p > n. These 
estimators give very good practical results. For instance in [Tib96], simulations 
and tests on real data have been provided for the LASSO. We also refer to 
|Kol07[ lKol09l IMVdGB08l lvdG08[ [DT07] ICH08j for related work with different 
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estimators: non-quadratic loss, penalties slightly different from t\ and random 
design. 

From a theoretical point of view, Sparsity Inequalities (SI) have been proved for 
these estimators under different assumptions. That is upper bounds of order of 

(cr 2 ||/?*|| logO)/n) for the errors (l/n)\\Xj3-X/3*\\% and 11/3-/3*11! have been 
derived, where /3 is one of the estimators mentioned above. In particular these 
bounds involve the number of non-zero coordinates in f3* (multiplied by log(p)), 
instead of dimension p. Such bounds garanty that under some assumptions, X/3 
and /3 are good estimators of A/3* and 0* respectively. According to the LASSO 
/? L , these SI are given for example in |BTW07[|BRT07] . whereas |CT07[ |BRT07J 
provided SI for the Dantzig Selector $ DS . On the other hand, Bunea [Bun08| 
establishes conditions which ensure (3 L and /3* have the same null coordinates. 
Analog results for (3 DS can be found in |Lou08| , 

Now, let us assume that we are given additional observations Xi € W for n + 

1 < i < m (with m > n), and introduce the matrix Z = (x[, . . . , x' m )' . Assume 
that the objective of the statistician is precisely to estimate Z(3* : namely, he 
cares about predicting what would be the labels attached to the additional x^s. 
It is argued in |Vap98| that in such specific estimator devoted to this 
task should be considered: the transductive estimator. This estimator differs 
from an estimator tailored for the estimation of (3* or X{3* like the LASSO. 
Indeed one usually builds an estimator $(X,Y) and then computes Z/3(X,Y) 
to estimate Z/3* . The approach taken here is to consider estimators $(X, Y. Z) 
exploiting the knowledge of Z, and then to compute Z(3{X 1 Y, Z). 

Some methods in supervised classification or regression were successfully 
extended to the transductive setting, such as the well-known Support Vector 
Machines (SVM) in |Vap98| , the Gibbs estimators in |Cat07| . It is argued in 
the semi-supervised learning literature (see for example ('S/.IKi for a recent 
survey) that taking into account the information on the design given by the 
new additional x^s has a stabilizing effect on the estimator. 

In this paper, we study a family of estimators which generalizes the LASSO 
and the Dantzig Selector. The considered family depends on a q x p matrix A, 
with q e IN, whose choice allows to adapt the estimator to the objective of the 
statistician. The choice of the matrix A allows to cover transductive setting. 

The rest of paper is organized as follows. In the next section, we motivate 
the use of the studied family of estimators through geometrical considerations 
stated in [AH08] . In Sections [3] and [H we establish Sparsity Inequalities for 
these estimators. A discussion on the assumptions needed to prove the SI is 
also provided. In particular, it is shown that the estimators devoted to the 
transductive setting satisfy these SI with weaker assumptions that those needed 
by the LASSO or the Dantzig Selector, when m > p > n. That is, when the 
number of news points is large enough. The implementation of our estimators 
and some numerical experiments are the purpose of Section [H The results 
clearly show that the use of a transductive version of the LASSO may improve 
the performance of the estimation. All proofs of the theoretical results are 
postponed to Section [7l 
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2 Preliminaries 



In this section we state geometrical considerations (projections on a confidence 
region) for the LASSO and the Dantzig Selector. These motivate the introduc- 
tion of our estimators. Finally we discuss the different objectives considered in 
this paper. 

Let us remind that a definition of the LASSO estimate is given by 

^j^{\\Y-xp\\l + *M\Ph}- ( 2 ) 

A dual form (in [OPT00] ) of this program is also of interest: 

argmin^gRp \\X(3\\ 2 

(3) 

a.t.WX'fy-X^KX; 

actually it is proved in |Alq08| that any solution of Program [3] is a solution of 
Program [2] and that the set { X/3} is the same where (3 is taken among all the 
solutions of Program [2] or among all the solutions of [31 So both programs are 
equivalent in terms of estimating X/3* . 

Now, let us remind the definition of the Dantzig Selector: 

argmin^gRp 1 1 1 1 a. 

(4) 

s.t.\\X'(Y -XP)\\oo<^- 

Alquier I Alq08] observed that both Programs [3j and [4] can be seen as a 
projection of the null vector P onto the region {(3 : ||X'(F — A/?)^ < A} that 
can be interpreted as a confidence region, with confidence 1 — T), for a given A 
that depends on rj (see Lemma 17.11 here for example) . The difference between 
the two programs is the distance (or semi-distance) used for the projection. 

Based on these geometrical considerations, we proposed in [AH08] to study 
the following transductive estimator: 

argmin^gRp \\Z/3\\ 2 

(5) 

s.t. \\X'(Y — -XT?) | loo < A; 

that is a projection on the same confidence region, but using a distance adapted 
to the transductive estimation problem. We proved a Sparsity Inequality for 
this estimator exploiting a novel sparsity measure. 

In this paper, we propose a generalized version of the LASSO and of the 
Dantzig Selector, based on the same geometrical remark. More precisely for q G 
IN* , let A be a q x p matrix. We propose two general estimators, Pa,x (extension 
of the LASSO, based on a generalization of Program [2]) and (3a,x (transductive 
Dantzig Selector, generalization of Program 0]) . These novel estimators depend 
on two tuning parameters: A > is a regularization parameter, it plays the 
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same role as the tuning parameter involved in the LASSO, and the matrix A 
that will allow to adapt the estimator to the objective of the statistician. More 
particularly, depending on the choice of the matrix A, this estimator can be 
adapted to one of the following objectives: 

• denoising objective: the estimation of Xf3* , that is a denoised version 
of Y, For this purpose, we consider the estimator Pa, a, with A = X . In 
this case, the estimator will actually be equal to the LASSO P L and (3a, a, 
with the same choice A = X will be equal to the Dantzig Selector; 

• transductive objective: the estimation of Z/3*, by $a,X or Pa,\, with 
A = yJn/mZ . We will refer the corresponding estimators as the "Trans- 
ductive LASSO" and "Transductive Dantzig Selector"; 

• estimation objective: the estimation of (3* itself, by /3a, a, with A = 
y/nl . In this case, it appears that both estimators are well defined only 
in the case p < n and are equal to a soft-thresholded version of the usual 
least-square estimator. 

For both estimators and all the above objectives, we prove SI (Sparsity 
Inequalities). Moreover, we show that these estimators can easily be computed. 



In this section, we deal with the "easy case", where Ker(A) = Ker(X) (think of 
A = X, A = \fnl ori= ^Jn/mZ). This setting is natural at least in the case 
p < n where both kernels are equal to {0} in general. We provide SI (Sparsity 
Inequality, Theorem 13. 3p for the studied estimators, based on the techniques 
developed in |BRTQ7j . 

3.1 Definition of the estimators 

Definition 3.1. For a given parameter A > and any matrix A such that 
Ker(A) — Ker(X), we consider the estimator given by 



where (X'X) 1 is exactly (X'X) 1 if (X'X) is invertible, and any pseudo- 
inverse of this matrix otherwise, and where is a diagonal matrix whose 



3 The "easy case": Ker(X) = Ker(Z) 





Remark 3.1. Equivalently we have 




where Y A = A{X' X)^ 1 X'Y . 
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Actually, we are going to consider three particular cases of this estimator in 
this work, depending on the objective of the statistician: 

• denoising objective: the LASSO, denoted here by $x,\, given by 



$x.x S arg min 



{||y-X/3||* + 2A||/3|| 1 } 



arg min {-2Y'X/3 + /3'X'X/3 + 2A||/3||i} 



(note that in this case, Ex = I since X is normalized); 

transductive objective: the Transductive LASSO, denoted here by 
(3 /— 7— „ \ : given by 



a ■ ( n 

p nr 7 x € arg mm < — 



Y z -Z(3 



2\\\E^zP\\i}\ 



estimation objective: P^j x, defined by 



4var,A€argmm{n 



Yi-p 



+ 2A||H^|| 1 }. 



Let us give the analogous definition for an extension of the Dantzig Selector. 

Definition 3.2. For a given parameter A > and any matrix A such that 
Ker(A) = Ker(X), we consider the estimator given by 



argmimjgRp 



Am = 



s.t. 



A'A((X'X) X'Y — 0) 



(6) 



< A. 



Here again, we are going to consider three cases, for A — X , A = ^Jn/mZ 
and A = y/nl, and it is easy to check that for A = X we have exactly the usual 
definition of the Dantzig Selector (Program |4|). Moreover, here again, note that 
we can rewrite this estimator: 



argminagRp 



0A,X = 



s.t. 



E^A'(Y A - AP) 



< A. 



The following proposition provides an interpretation of our estimators when 

A = y/EL 

Proposition 3.1. Let us assume that (X'X) is invertible. Then P^n t \ — 
Pjni,\ an d this is a soft-thresholded least-square estimator: let us put j3 LSE = 
(X' X)~ 1 X'Y then \ is the vector obtained by replacing the j-th coordinate 
bj — $ LSE of [3 LSE by sgn(bj) (\bj\ — \^j(nl)/n) + , where we use the standard 
notation sgn(x) = +1 if x > 0, sgn(x) = — 1 if x < and (x)+ = max(x,0). 
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Proposition 13.21 deals with a dual definition of the estimator $a,\- 
Proposition 3.2. When Ker(A) = Ker(X), the solutions (3 of the fo 

argminggRp \\A(3\\l 

< 



program: 



s.t. 



Z-/A'((Y A ~ A0) 



< A 



all satisfy Xf3 = X(3a,\ and Af3 = A(3a,\- 
Proofs can be found in Section [3, pagefTBl 



3.2 Theoretical results 

Let us first introduce our main assumption. This assumption is stated with a 
given p x p matrix M and a given real number x > 0. 

Assumption H(M, x): there is a constant c(M) > such that, for any a € R p 
such that E i:/3 . =0 Cj(M) \aj\ < x £ j:/3 . #0 £j(M) \a 3 \ we have 



a Ma > c(M)n 



E 



(7) 



First, let us explain briefly the meaning of this hypothesis. In the case, 
where M is invertible, the condition 



x'Ma > c(M)n 



E 



is always satisfied for any a &W with c(M) larger than the smallest eigenvalue 
of M/n. However, for the LASSO, we have M = (X'X) and M cannot be 
invertible if p > n. Even in this case, Assumption H(M, x) may still be satis- 
fied. Indeed, the assumption requires that Inequality ([7]) holds only for a small 
for a small subset of 1R, P determined by the condition Ylj-.p*=o < 
x J2j-/3* 9 Lo£j(M)\(Xj\ ■ For M — (X'X), this assumption becomes exactly the 
one taken in [BTW07| . In that paper, the necessity of such an hypothesis is 
also discussed. 

Theorem 3.3. Let us assume that Assumption H(A'A,3) is satisfied and that 
Ker(A) = Ker(X). Let us choose < rj < 1 and A = 2<jy / 2n\og (p/rf). With 
probability at least 1 — r\ on the draw of Y , we have simultaneously 



A 



(Am 



2 < 72(j2 

2 ~ c(A'A) 



log 



and 



2a ($a,x 



2472a flog(p/ V ) 
i - c(A'A) 



n 



E ZM), 



E ZM)- 
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In particular, the first inequality gives 

• if Assumption H(X'X, 3) is satisfied, with probability at least 1 — rj, 
1 



n 



X 



2 72cr 2 „ n* 

< \\0 

2 - nc(X'X) 11 ' 



lologg); 



• if Assumption H(^-Z' Z,3) is satisfied, and if Ker(Z) = Ker(X), with 
probability at least 1 — T), 

72a 2 



1 

m 



(pz.x - 0* 



2 - nd^Z'Z) ^ w 



E & 



and if (X'X) is invertible, with probability at least 1 — 77, 



2 < 72a 2 
2 ~ nc{nl) 



E &(n/)log(^ 



This result shows that each of these three estimators satisfy at least a SI for 
the task it is designed for. For example, the LASSO is proved to have "good" 
performance for the estimation of X0* and the Transductive LASSO is proved 
to have good performance for the estimation of Z/3* . However we cannot assert 
that, for example, the LASSO performs better than the Transductive LASSO 
for the estimation of Z/3* . 

Remark 3.2. For A — X , the particular case of our result applied to the LASSO 
is quite similar to the result given in [DTW07] on the LASSO. Actually, Theorem 
\3.3\ can be seen as a generalization of the result in [DTW07] and it should be 
noted that the proof used to prove Theorem \ 3.3\ uses arguments introduced in 
\BTW()% 

Remark 3.3. As soon as A 1 A is better determined than X'X, Assumption 
H(A,x) is less restrictive than H(X'X,x). In particular, in the case where 
m > n, Assumption H((n/m)Z' Z,x) is expected to be less restrictive than As- 
sumption H(X'X, x). 

Now we give the analogous result for the estimator 0a,\- 

Theorem 3.4. Let us assume that Assumption H(A'A, 1) is satisfied and that 
Ker(A) = Ker(X). Let us choose < 77 < 1 and A = 2ayj2n\og (p/rj). With 
probability at least 1 — r\ on the draw of Y , we have simultaneously 



A 



(h,\ - 0' 



and 



2 a (ft a, a 



2 72a 2 
2 - c(A'A) 



12V2cr 
1 - c(A'A) 



log 



log (p/rj) 



E 



E 
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4 An extension to the general case 



In this section, we only deal with the transductive setting, A = y/n/mZ. Let 
us remind that in such a framework, we observe X which consists of some 
observations Xi associated to labels Y in Y, for i 6 {1, . . . , n}. Moreover we 
have additional observations Xi for i G {n + 1, . . . , to} with m > n. We also 
recall that Z contains all the Xi for i £ {1, . . . , to} and that the objective is to 
estimate the corresponding labels Yi, let us put Y — (Yi., . . . ,Y m )' . 



4.1 General remarks 

Let us have look at the definition of p 
mark I3TT1 



n/mZ,X 



for example as given in Re- 



$ nr zx e arg mini- \\y z - Z0 * + 2A||S 

V ™ ' A /3eRP I TO II 2 m J 



where actually Yz = Z yX'Xj XY can be interpreted as a preliminary esti- 
mator of Y. Hence, in any case, we propose the following procedure. 
Let us assume that, depending on the context, the user has a natural (and not 
necessary efficient) estimator of Y = (Yi, . . . , Y n+m )' . Note this estimator Y . 



Definition 4.1. The Transductive LASSO is given by: 

Py.^x € arg mm {- ||Y - Z/3|| a + 2A||S^z/5|| 1 }, 

and i/ie Transductive Dantzig Selector is defined as: 

argmin^eKp WPWx 



z.\ 



s.t. 



^-3-1 Z'(Y-Z0) 



< A. 



In the next subsection, we propose a context where we have a natural esti- 
mator Y and give a SI on this estimator. 



4.2 An example: small labeled dataset, large unlabeled 
dataset 

The idea of this example is to consider the case where the examples Xi for 
1 < i < n are "representative" of the large populations Xi for 1 < i < to. 

Consider, Z = (x[, . . . , x' m )' where the x[s are the points of interest: we 
want to estimate Y — Z/3* . However, we just have a very expensive and noisy 
procedure, that, given a point Xi, returns Y = x ifi* + £»j where the e^s are 
jV(0, a 2 ) independent random variables. In such a case, the procedure cannot 
be applied for the whole dataset Z — (x[, . . . ,x' m )' . We can only make a deal 
with a "representative" sample of size n. A typical case could be n < p < to. 
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First, let us introduce a slight modification of our main hypothesis. It is also 
stated with a given p x p matrix M and a given real number x > 0. 

Assumption H'(M,x): there is a c(M) > such that, for any a € R p such 
that Ej:/s?=o Kl ^ zEj^o Kl we have 



^ a?. 



tt'Ma > c(M)n 



We can now state our main result. 

Theorem 4.1. Let us assume that Assumption H'((n/m)Z 'Z, 1) «'s satisfied. 
Let us choose < 77 < 1 and Ai = A2 = 10~ 1 ay / 2n log (p/rj). Moreover, let us 
assume that 



< Sf nl0S U 



Vti e R p tcn£ft ||w||i < H/Hli, f(X'X) - -(Z'Z)) 

\ in J 

(8) 

Lei Yai = Z^x,X! be a preliminary estimator ofY, based on ths Dantzig Selector 
given by ([6]) (with A = X). Then define the Transductive LASSO by 



Z,20\ 2 



argmin^gRp % \\Z(3\\ 2 
s.t.\\%Z'(Y Xl -ZP)\\ oo <20\ 2 , 
and the Transductive Dantzig Selector 

argmin^gRp 

s.t.\\^Z'(Y Xl -Z(3)\\ x <X 2 . 
With probability at least 1 — r\ on the draw of Y , we have simultaneously 



3* 



z,\ 2 



1 

m 



Z{f3*, 



31 



z.\ 2 



z,\ 2 



< 



2 < /f 7 * log ( P - )Wh, 
2 nc([n/m)Z Z) \rj 1 



8(7 



1 c((n/m)Z'Z) \ n J 



log {p/rj) V 



I/Silo, 



moreover, if H'((n/m)Z'Z, 5) «'s a/so satisfied, 



1 

m 



< 



^ io g f£ 

2 nc((n/m)Z' Z) \rj 

1 

54(7 f log (p/rf) 



o, 



1 c((n/m)Z' Z) \ n 
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First, let us remark that the preliminary estimator Y\ t is defined using the 
Dantzig Selector Px,Xi ■ We could give exactly the same kind of results using a 
the LASSO $ x preliminary estimator. 

Now, let us give a look at the new hypothesis, Inequality ([8]). We can 
interpret this condition as the fact that the Xi'a for 1 < i < n are effectively 
representative of the wide population: so X'X/n is "not too far" from Z'Z/m. 
We will end this section by a result that proves that this is effectively the case 
in a typical situation. 

Proposition 4.2. Assume that m = kn for an integer value k S IN\ {0, 1}. Let 
us assume that X and Z are build in the following way: we have a population 
Xi = (xMj ■ . ■ , Xi-.p) G W ,■ ■ ■ , Xm € W (the points of interest). Then, we draw 
uniformly without replacement, n of the Xi 's to be put in X : more formally, but 
equivalently , we draw uniformly a permutation a of {1, ... , m} and we put X — 
{x[, ...,x' n )' = (x^ij. — .X^n))' andZ= {x^ . . . , x' m )' = {x' a(1) , . . . , x' a(m) )' ■ 
Let us assume that for any € {1, . . . , m} x {1, . . . ,p}, xfj < K f or some 
k > 0, and that p > 2. Then, with probability at least 1 — rj, for any u € W, 



X'X 



-Z'Z) u 



< u 



2nk 
1 fr=T 



2 log 



P 



In particular, if we have 



\\u\\i < ||/3* ||i and k < 



k-l a 

10* Plh 



then we have 



X'X - -Z'Z) u 

m 



< g \ 1 2nlog I — 



Let us just mention that the assumption m = kn is not restrictive. It has 
been introduced for the sake of simplicity. 



5 Experimental results 

Implementation. Since the paper of Tibshirani |Tib96] . several effective al- 
gorithms to compute the LASSO h ave been proposed and studied (for instance 
Interior Po ints metho ds |KKL+07| . LARS |EHJT04], Pathw ise Coordinate Op- 
timization |FHHT07j . Relaxed Greedy Algorithms |HCB08| ). For the Dantzig 
Selector, a linear method was proposed in the first paper |CT07| . The LARS 
algorithm was also successfully extended in [JRL09] to compute the Dantzig 
Selector. 

Then there are many algorithms to compute 0a, a and (3a, a, when A = X. 
Thanks to Proposition I3.1[ it is also clear that we can easily find an efficient 
algorithm for the case A = y/nl . 

The general form of the estimators (3a,x and Pa, a given by Definitions 13.11 
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and 13,21 allows to use one of the algorithms mentioned previously to compute 
our estimator in two cases. For example, from Remark l3.14 we have: 



Pa,x £ arg min \ 



Ya-A/3 



2A||S A /3||i}, 



then we just have to compute Ya, to put B = AE A , to use any program that 
computes the LASSO to determine 



7 G arg min < 



Ya-Bj 



2A 



hlli} 



^ -i 

-A 7- 



and then to put Pa, a 

In the rest of this section, we compare the LASSO and the transductive 
LASSO on the classical toy example introduced by Tibshirani [Tib96] and used 
as a benchmark. 



Data description. In the model proposed by Tibshirani, we have 

Yi = xtf* + Si 

for i € {l,...,n}, (3* € W and the e t are i.i.d. Af(0, a 2 ). Finally, the 
i x i)ie{i,...,m} are generated from a probability distribution: they are indepen- 
dent and identically distributed 



■AT 



( 



\ 




( 1 

p 

V p p -' 



p"- 1 \ \ 

nP- 2 



1 

P 



P 

1 / 



for a given ps]-l,l[. 

As in [Tib96], we set p = 8. In a first experiment, we take (n, m) — (7, 10), 
p = 0.5, a = 1 and f3* = (3,1.5,0,0,2,0,0,0) ("sparse"). Then, in order to 
check the robustness of the results, we consider successively p — 0.5 by p = 0.9 
(correlated variables), a — 1 by a = 3 (noisy case), j3* = (3,1.5,0,0,2,0,0,0) 
by P* = (5,0,0,0,0,0,0,0) ("very sparse" case), (n,m) = (7,10) by (n,m) = 
(7,20) (larger unlabeled set), (n,m) = (20,30) (p < n, easy case) and finally 
(ra, m) = (20, 120). 

We use the version of the Transductive LASSO proposed in Section 0J for 
a given Ai, we first compute the LASSO estimator $x,Xi- I n the sequel, the 
Transductive LASSO is given by 



argmin^gRp ^ \\Zfi\\ 2 



/3 Ti (Ai,A 2 ) = 



s.t. 



-Z'{Z(3 



X,Xi 



zp) 



< A, 
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for a given A2. We compare this two step procedure with the procedure ob- 
tained using the usual LASSO only: L {\) = X ,\ for a given A that may differ 
from Ai. In both cases, the solutions are computed using PCO algorithm. We 
compute L {\) and TL {\ 1 ,\ 2 ) for (A,Ai,A 2 ) G A 3 where A 3 = {l.2 k ,k = 
—50, —49, . . . , 30}. In the next subsection, we examine the performance of each 
estimator according to the value of the regularization parameters. 

Results. We illustrate here some of the results obtained in the considered cases. 

Case (n, m) — (7, 10), p — 0.5, a = 1 and 0* "sparse": 

We simulated 100 experiments and studied the distribution of 



PERF{X) 
PERF(Z) 



iin( Al ,A 2 ) £ A2 |[A(/3 TL (Ai, A 2 ) - 
min AGA \\X0L(\) - 0*)f 2 

hn (Al , A2)eA2 ||^(/3 TL (A 1 ,A 2 )-/3*)||| 



and 



min AeA ||Z(/3 L (A) - 0*)\\ 
PERF{ f] = min (A 1 ,A 2 )eA^ ||/3 Ti (Ai, A 2 ) - 



min AeA \\p L (\) - 0* 



over all the experiments. 

For example, we plot (Figure QJ the histogram of PERF(X) (actually, the 
three distributions where quite similar) . We observe that in 50% of the simu- 
lations, min (Al>A2)eA2 \\X(0 TL (X 1 , A 2 ) - 0*)f 2 = min (Ali0)6 A« \\X0 TL (X 1 , 0) - 
0*)\\j = min AeA \\X(j3 L (\) - 0*)\\j. In these cases, the Transductive LASSO 
does not improve at all the LASSO. But in the others 50%, the Transductive 
LASSO actually improve the LASSO, and the improvement is sometimes really 
important. We give an overview of the results in Table [TJ 

The other cases : 

The following conclusions emerge of the experiments: first, 0* — (5, 0, ... , 0) 
leads to a more significative improvement of the Transductive LASSO compared 
to the LASSO (Table [1]). This good performance of the Transductive LASSO 
can also be observed when (n,m) = (7,10) and (n, m) = (7,20). However in 
the case n > p (easy case), i.e., (n,m) = (20,30) and (n,m) — (20,120), the 
improvement of the Transductive LASSO with respect to the LASSO becomes 
less significant (Table [l]). 

Finally, p and a have of course a significant influence on the performance of the 
LASSO. However these parameters do not seem to have any influence on the 
relative performance of the Transductive LASSO with respect to the LASSO 
(see for instant the three last rows in Table [J, where we kept (n, m) — (20, 30)). 
Quite surprisingly, the relative performance of both estimators does not strongly 
depend on the estimation objective 0* , X0* or Z0* , but on the particular exper- 
iment we deal with. According to the realized study and for all the objectives, 
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Figure 1: Histogram of PERF(X) with (n,m) = (7, 10), p = 0.5, a = 1 and 
(3* = (3,1.5,0,0,2,0,0,0). 



the Transductive LASSO performs better than the LASSO in about 50% of the 
experiments. Otherwise, Ax = is the optimal tuning parameter and then, the 
LASSO and the Transductive LASSO are equivalent. 

Also surprising is that as often as not, the minimum in 

mm \\X0 TL (X u X 2 ) - < mm \\X(f3 TL (X u 0) - 

(Ai : A 2 )GA 2 (Ai,0)eA 2 

does not significantly depend on \\ for a very large range of values Ai. This is 
quite interesting for a practitioner as it means that when we use the Transduc- 
tive LASSO, we deal with only a singular unknown tuning parameter (that is 
A2) and not two. 

Discussion on the regularization parameter. Finally, we would like to 
point out the importance of the tuning parameter A (in a general term) . Figure[2] 
illustrates a graph of a typical experiment. There are two curves on this graph, 
that represent the quantities (l/n)|| X0 L {\)-(3*)\\1 and {\/m)\\Z{(3 L {\)-f3*)\\l 
with respect to A. We observe that both functions do not reach their minimum 
value for the same value of A (the minimum is highlighted on the graph by a 
dot), even if these minimum are quite close. 

Since we consider variable selection methods, the identification of the true 
support {j : [3* 7^ 0} of the vector (3* is also in concern. One expects that 
the estimator f3 and the true vector (3* share the same support at least when 
n is large enough. This is known as the variable selection consistency prob- 
lem and it has been considered for the LASSO estimator in several works (see 
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Table 1: Evaluation of the mean ME and the quantile Q3 of order 0.3 of 
PERF(I), PERF(X) and PERF(Z). In these experiments, a always equals 
1. The case sparse corresponds to (3* — (3, 1.5, 0, 0, 2, 0, 0, 0) while the case very 
sparse corresponds to 0* — (5, 0, 0, 0, 0, 0, 0, 0). 





PERF(I) 


PERF{X) 


PERF(Z) 


(3* 


(n, m) 


P 


a 


ME 


Q 3 


ME 


Q 3 


ME 


Qs 


VERY SPARSE 


(7,10) 


0.5 


1 


0.74 


0.71 


0.76 


0.71 


0.75 


0.70 


SPARSE 


(7,10) 


0.5 


1 


0.83 


0.76 


0.86 


0.80 


0.88 


0.88 


SPARSE 


(7,20) 


0.5 


1 


0.84 


0.79 


0.84 


0.81 


0.88 


0.89 


SPARSE 


(20,30) 


0.5 


1 


0.91 


0.90 


0.93 


0.93 


0.93 


0.95 


SPARSE 


(20,30) 


0.9 


1 


0.91 


0.93 


0.94 


0.95 


0.93 


0.96 


SPARSE 


(20,30) 


0.5 


3 


0.90 


0.89 


0.92 


0.92 


0.92 


0.93 



|Bun08j [MB061 IMY09| IWai06l IZY06j ). Recently, |Lou08j provided the variable 
selection consistency of the Dantzig Selector. Other popular selection proce- 
dures, based on the LASSO estimator, such as the Adaptive LASSO |Zou06] . 
the SCAD [FLOl] . the S-LASSO |Heb08| and the Group-LASSO |Bac08| . have 
also been studied under a variable selection point of view. Following our pre- 
vious work |AH08j . it is possible to provide such results for the Transductive 
LASSO. 

The variable selection task has also been illustrated in Figured We reported 
the minimal value of A for which the LASSO estimator identifies correctly the 
non zero components of (3* . This value of A is quite different from the values 
that minimizes the prediction losses. This observation is recurrent in almost 
all the experiments: the estimation X(3* , Z(3* and the support of (3* are three 
different objectives and have to be treated separately. We cannot expect in 
general to find a choice for A which makes the LASSO, for instance, has good 
performance for all the mentioned objective simultaneously. 

6 Conclusion 

In this paper, we propose an extension of the LASSO and the Dantzig Se- 
lector for which we provide theoretical results with less restrictive hypothesis 
than in previous works. These estimators have a nice interpretation in terms 
of transductive prediction. Moreover, we study the practical performance of 
the proposed transductive estimators on simulated data. It turns out that the 
benefit using such methods is emphasized when the model is sparse and partic- 
ularly when the samples sizes (n labeled points and m unlabeled points) and 
dimension p are such that n < p < m. 



1.5 



Figure 2: Performance vs. A. 



7 Proofs 

In this section, we state the proofs of our main results. 
7.1 Proof of Propositions EH] and EH 

Proof of Proposition \3.1[ Let us assume that (X'X) is invertible. Then just 
remark that the criterion minimized by /3^j \ is just 



P LSE - P 



2A||S„^|| 1 =^|[4 



jLSE 



ft 



2A&(v^J) 



So we can optimize with respect to each coordinate f3j individually. It is quite 
easy to check that the solution is, for fy, 



sgn ($f SE ) 



LSE 



n J + 



The proof for is also easy as it solves 

argmin^gKP x 



s.t. 



< A. 



□ 
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Proof of Proposition \3.2\ Let us write the Lagrangian of the program 

2 



argmin^giRp \\A/3\\ 



s.t. 



~>-/(A'A){{X'X)- 1 X'Y-(3) 



< A, 



£(/?, 7, lA = P(Z'Z)f3 + i S^ 1 {A'A){[X'X)- 1 X'Y - p) - XE 



E^(A'A)(p - {X'X)- X X'Y) - XE 



with E = (1, . . . , 1)', and for any j, ■jj > 0, fij > and jjfij = 0. Any solution 
/3 = /3(7, /x) must satisfy 

8C 



so. 



(A'A)f3 = {A<A)~- A 



Note that the conditions 7, > 0, /ij > and Jjfij — means that there is 

a Q G R such that = £/(^)G«j - 7j)/2, 101 = + »j)/2, and 

i i 
so 7j = 2(Cj/£,j (^))- an d Mj — where (a) + = mai(a;0) and 

(a)_ = moi(-a;0). Let also £ denote the vector which j-th component is 
exactly Q, we obtain 

(A'A)p = (A'A)C, 

or, using the condition Ker(A) = Ker(X), X(3_ = XC, and A§_ = AQ. This leads 
to 

£(A 7 , M) - -2F'X(A 7; X)- 1 (A'A)C + C(A'A)C + 2A||H A C||i, 

and note that the first order condition also implies that 7 and fi (and so Q 
maximize C. This ends the proof. □ 



7.2 A useful Lemma 

The following lemma will be used in the proofs of Theorems 13.31 and 13.41 

Lemma 7.1. Let us put e = (ex, . . . ,£«.)', IfKer(A) = Ker(AT) we have, with 
probability at least 1 — 7/, 



Vje{i,...,p}, 

or, in other words, 



A'A{X'X)- 1 X'e 



<^(A)aj2n\og^, 



\E- A \A'A){{X'X)-'X'Y - p*)\\oo < ^/2nlog£. 
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Proof of the lemma. By definition, e ~ Af(0, o~ 2 I) and so 

(A'A)(X r X)- 1 X'e ~ AA(0, a 2 (A'A)(X r X)- 1 (A'A)). 

So, for all j, [{A'A)[X r X)- 1 X'£] j comes from a A/"(0, cr 2 £]{A)) distribution. 
This implies the first point, the second one is trivial using Y — X/3* + e. □ 

7.3 Proof of Theorems EH and EH 

Proof of Theorem \3.3l By definition of $a,x we have 

- 2Y'X{X r xy 1 {A'A)p A + + (p AtX )' (A'A)0 AlX + 2A||S a , a /3 a ,a||i 



< 2Y'X(X'X)- 1 (A'A)f3* + (P*)'(A'A)f3* + 2A||S A /3*||i. 
Since Y = X(3* + e, we obtain 

2{p*)'X'X{X r X)- 1 {A'A) (> - /3 AA ) + (/3 A>A )' (A'A)0 AiX - (/3*)' (A'A)0* 
+ 2e'X{X r X)-\A'A) (p* - p A ,\) < 2A||H A /T|| 1 - 2A||S A /3 A>A ||i. 

Now, if Ker(X) = Ker(A) then we have X' X0CX)~ 1 (A A) = (A' A) and then 
the previous inequality leads to 

(p*-p A ,\) '(A'A) (p*-p A ,\) 

< 2e'X(X i X)- l {A'A) (/3 A>A - /?*) + 2A||S A /3*||i - 2A||S A /3 A;A || 1 . (9) 

'xOdxr^A'A) (/3 A , A -/?*). 



Now we have to work on the term 2e 



2e' 



'x0dxy x {XA) (& AtX -p) 



2 > ' [p A>x - ? 



< 2 



E (A 



Note that 

{A'A)(X r X)- 1 X'e 

- j 

(A'A){X r X)- 1 X'e 



< 2^2nlog(jj|^/(A) (^,a).-^ 

with probability at least 1 — 77, by Lemma mi We plug this result into Inequal- 
ity ([9]) (and replace A by its value 2a\j2n log(p/7j)) to obtain 

(p*-p A ,x)' (A'A) (/3*-/3 A>A ) 
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and then 



+ 2 1/3 



-2cr< / 2n loe 



l#l 



< 8a,/2nlog(^) £ 



(10) 



This implies, in particular, that (3* — $a,\ is an admissible vector a in Assump- 
tion H (A' A, 3) because 



J2tJ(A) (fax) i] <4 £ (&,a) 
On the other hand, thanks to Inequality (fT0|) . we have 



/r- 


fax) 


2n lo 


•(f) 







< 6ct 



< 6cr 
" \ 


2n ^ 






E O(^) log 












E fo(M)io g 


(?) 



where we used Assumption i? (A' A, 3) for the last inequality. Then 

(p-fax)'(A'A) ((3* -fax) < 72^- log (l) E 0(A). (12) 
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A similar reasoning as in (fTT]) leads to 



2^2nlog(jjE<;/(A) ( ''.a) r 



< 8cr 



\ 



c(A'A) 



({3*-fa,x)'(A>A)((3*-fa, x ) E 6-(M)log(2 



Finally, combine this last inequality with (fT2"| to obtain the desired bound for 



(/?* - fax) 



This ends the proof. 



Proof of Theorem \3.4\ We have 



□ 



{fax - n\A'A)(fa x - /T) = [E A (fa tX - (3*)}'E^(A'A)(fa >x - 13*) 

< \\z A (fax - whws^iA'A)^ - /mu 

< \\Z A (fax - /3*)||x|||H^ 1 (^)((Fx)- 1 X'Y - /3*)|U 

+ HS^ 1 ^)!^)- 1 ^ - fax)\\oo\, (13) 



by the constraint in the definition on (3 At \ we have 



WS^iA'AXiX'X^X'Y - fa x )\\oo < A, 



while Lemma fTTTI implies that for A = 2er^/2n log(p/ry) we have 

\\E A 1 (A'A)((JcX)- 1 X / Y-n\\oo < ± 
with probability at least 1 — r\\ and so: 

{Pa,x - n'{A'A){fa x - /?*) < — \\E A (fax - P*)h- 
Moreover note that, by definition, 
< \\E A /3*\\i - \\E A fa,x\\i 

= E 4( A ) \%\ - E tf(A)\(faxh\ E $H A ) \(^,x) 
< E $( A ) k - mI- E w # - (^.al 



«=0 
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this implies that (3* — ((3a, a) is an admissible vector in the relation that defines 
Assumption H(A'A, 1). Let us combine this result with Inequality lfT3|) . we 
obtain 

{@A,\ ~ n'(A'A)0 AtX - (3*) < y \\E A (JT - /? A , A )||i 

<3\y,zH a )\pj-(p^ 



< 3A 



E&w E k-fco. 



<3A[ 



TjtCSa.a-W^GSa.a-^). (14) 



So we have, 



C8a,a - - /?*) < SA 2 ^^ X! 

and as a consequence, Inequality (fT4| gives the upper bound on ||S^(/3a,a — 
and this ends the proof. □ 

7.4 Proof of Theorem 14.11 



Proof of Theorem \4-l\ The proof is almost the same as in the previous case. 
For the sake of simplicity, let us write 0* instead of (3* ,—— and the same 

for j3* . We first give a look at the Dantzig Selector: 



£ (p* -/3*y Z'z(j3* -/3*) 



(3* -(3* 













< 


(3* -(3* 


\ 


-z' (zp*-? Xl ) 


+ 








m \ 1 


OO 



— Z'Z[(3* - (3 

-Z' (Zf3*-Y Xl ) 

m, 

'(Zf3*-Y xi ) +\\X'(Xp*-Y)\\ oo 

x' (xj3 XM - y) L + J (2-z'z - x'x) (jr - Pxm) 









< 


P*-f3* 





(15) 



By Lemma \7A\ for Ai = 10 1 a \/ 2n\og(p / if) we have 

WX'Y-X'XpWn < 10Ai, 
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with probability at least 1 — r\. On the other hand, we have 
II/?* -Pxm Hi <H01i + 11^ 111 < 211/3*11!, 

by definition of the Dantzig Selector. Then, let u — ((3* — $x,X x )/^ and use 
Inequality J8]) for this specific u. This ensures that 



(J"- n ) ("*-<M 



< 2Ai. 



(16) 



The definition of the Dantzig Selector also implies that 



' (xp XM - y) 



< Ax, 



and finally the definition of the estimator leads to 



< A 2 = Ai, 



and as a consequence, Inequality (fl5| becomes 



(§* ~/3*y z'z (/?*-/?*) 

Using the fact that ||/3*||i < gives 



n 
m 



< 14Ai 



n 
m 



< 14Ai 



(3* -(3* 



i < 28A! \% ~ 



< 28Ai 



\ 



<28A 1 |{j:/3*^0}| 



1 



nc(n/m(Z'Z)) to 



(>-/3*)' Z'Z (>-/?*). (17) 



To establish the last inequality, we used Assumption H'((n/m)Z'Z,l). Then 
we have, 



£ (p* - p) ' Z'Z (f - /?*) < 28 2 A? | { j : /?* ^ 0} | 



1 



nc(n/m(Z' Z)) ' 



This inequality, combined with (fl7|) . end the proof for the Dantzig Selector. 

Now, let us deal with the LASSO case. The dual form of the definition of 
the estimator leads to 

77 ~ ~ 77 

- 2-Y Xl Z(3* + -((3*)'Z'Z(3* +40A 2 \\(3* ||i 
m m 

77 - 77 

< -2-F Al Z/3* + —{pyz'ZF + 40A 2 ||/3*||i 

TO TO 
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and so 



2-/3x^^/3* + —{pyz'Zp +40A 2 ||/3*||i 
7ii m 



77, ~ 77, 

< + -{&*)' Z'Zfi* +40A 2 ||/3*||i. 



As a consequence, 

£ ($* -f3*^'z'z(p* -/3*) 

< 2^ (/3* - /r)'z'Z (/3 XiAl - /r) + 40A 2 (11/3*11! - ||/3*||i) . 

Now, we try to upper bound (j3* — (3*^j Z' Z (fix.M — 0*\ We remark that 

£ (/3* - Z'Z (/3 x ,a x - /3*) < p* - ||^(^) (fe - /? : 

/3* -/3 



< 



X'X [Px,Xx - P" 



< 13Ai 



/3* -/3* 



where we used (fl6l) and the fact that 



Then we have 



< 



X 



'(xpxM-Y) +||A w e|| oc < Ai + 



10Ai = HAi. 



n 
m 



(/3* -/3*)'z'z(/3* -/3*) 



< 26Ai 



/3* -/3* 



+ 40A 2 11/3*11! -11/3*11! , 



and so 



— (/3* - (3*^ Z'Z ($* - /3*) + 14Ai 



/3* -/3* 



< 40Ai 



(3* -f3* 



+ W\\x-\\P*\\ 



Up to a multiplying constant, the rest of the proof of Theorem 14. II is the same 
as the last lines in the proof of Theorem 13.31 Then we omit it here. □ 



7.5 Proof of Proposition I4T21 

Proof of Proposition \4~^\ First, let us remark that 
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'X'X--Z'Z)u 
K 771 



n sup 



" J=l 



< n 1 1 1 1 x sup 

l<i,j<p 



Now, using the "exchangeable-distribution inequality" in [Cat07f we obtain, for 
a given pair (i, j), for any r > 0, with probability at least 1 — 77, 



X i X j Z i Z i 



< 



Tk 2 



m 2n(k + l) 2 



log 1 



< 



rk 2 K 2 logi K fc /2l0gi 



2n(fc + l) 5 



fc- 1 



for t = (log(l/?7)(fc — \)2n/kK 2 ) 1 / 2 and so, by a union bound argument, with 
probability at least 1 — 7/, for any pair (i, j), 



771 



< 



/cfc / 2 log 



2;,- 



fe - 1 



< 



2nk 21og£ 



jfe-1 



(where we used p > 2). 



□ 
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